In this notebook you can find some descriptive analysis of the energy consumption of the family and the forecasting of consumption for the future months. Indeed, the goal of the project is to “sell” the submetering devices showing to the customers how they can benefit from the data collected by the devices. So, on the one hand, we show which useful information they can obtain considering “real time data” and agregated data coming from the devices, on the other hand how we can forecast the energy consumption using past data.
# We set the options for the Rmarkdown
knitr::opts_chunk$set( warning = FALSE)
# We load the libraries
if(require("pacman")=="FALSE"){
install.packages("pacman")
}
pacman::p_load(plotly, corrplot, lubridate, hms, Hmisc, imputeTS,rstudioapi,kknn,ggplot2, randomForest, dplyr, shiny, tidyr, ggplot2, caret, forecast)
We look at the dataset structure.
data <- read.delim("household_power_consumption.txt",header = TRUE, sep=";")
head(data)
We put the variables in the right type.
data$Global_reactive_power = suppressWarnings(as.numeric(as.character(data$Global_reactive_power)))
data$Global_active_power = suppressWarnings(as.numeric(as.character(data$Global_active_power)))
data$Sub_metering_1 = suppressWarnings(as.numeric(as.character(data$Sub_metering_1)))
data$Sub_metering_2= suppressWarnings(as.numeric(as.character(data$Sub_metering_2)))
data$Global_intensity = suppressWarnings(as.numeric(as.character(data$Global_intensity)))
data$Voltage = suppressWarnings(as.numeric(as.character(data$Voltage)))
We change the unit of measure of the active power and reactive power in order to be the same of the submetering (watt-hour). We create a column with the date and the time joined. Finally we change the time in order to take in consideration the solar and legal hour.
#Change the unite measure
data$Global_active_power = data$Global_active_power * 1000 / 60
data$Global_reactive_power = data$Global_reactive_power * 1000 / 60
#create a column in the format strptime
data<-cbind(data,paste(data$Date,data$Time), stringsAsFactors=FALSE)
colnames(data)[10] <-"DateTime"
data<- data[,c(ncol(data), 1:(ncol(data)-1))]
data$DateTime <- strptime(data$DateTime, "%d/%m/%Y %H:%M:%S", tz="GMT")
data$Date <- NULL
data$Time <-NULL
#solar and legal hour
a = which(data$DateTime == as.POSIXct("2007-03-25 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2007-10-28 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
a = which(data$DateTime == as.POSIXct("2008-03-30 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2008-10-26 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
a = which(data$DateTime == as.POSIXct("2009-03-29 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2009-10-25 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
a = which(data$DateTime == as.POSIXct("2010-03-28 02:00:00", tz = "GMT") | data$DateTime ==as.POSIXct( "2010-10-31 01:59:00", tz = "GMT"))
data[a[1]:a[2],1] = data[a[1]:a[2],1]+3600
# Recreate the date column and reorder the order of the columns.
data$Date <- as.Date(data$DateTime)
data=data[,c(1,9,2:8)]
We look at some summary measures of the variables.
summary(data)
DateTime Date Global_active_power
Min. :2006-12-16 17:24:00 Min. :2006-12-16 Min. : 1.267
1st Qu.:2007-12-12 00:18:30 1st Qu.:2007-12-12 1st Qu.: 5.133
Median :2008-12-06 07:13:00 Median :2008-12-06 Median : 10.033
Mean :2008-12-06 07:48:33 Mean :2008-12-05 Mean : 18.194
3rd Qu.:2009-12-01 14:07:30 3rd Qu.:2009-12-01 3rd Qu.: 25.467
Max. :2010-11-26 21:02:00 Max. :2010-11-26 Max. :185.367
NA's :25979
Global_reactive_power Voltage Global_intensity Sub_metering_1
Min. : 0.000 Min. :223.2 Min. : 0.200 Min. : 0.000
1st Qu.: 0.800 1st Qu.:239.0 1st Qu.: 1.400 1st Qu.: 0.000
Median : 1.667 Median :241.0 Median : 2.600 Median : 0.000
Mean : 2.062 Mean :240.8 Mean : 4.628 Mean : 1.122
3rd Qu.: 3.233 3rd Qu.:242.9 3rd Qu.: 6.400 3rd Qu.: 0.000
Max. :23.167 Max. :254.2 Max. :48.400 Max. :88.000
NA's :25979 NA's :25979 NA's :25979 NA's :25979
Sub_metering_2 Sub_metering_3
Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000
Median : 0.000 Median : 1.000
Mean : 1.299 Mean : 6.458
3rd Qu.: 1.000 3rd Qu.:17.000
Max. :80.000 Max. :31.000
NA's :25979 NA's :25979
The missing values are the same quantity for all the numerical variables. Morevore we hace checked they are missing in the same rows. We take a look at how they are distributed. In particuar at their daily quantity.
#create a table with the missing values
DATA=data[which(is.na(data$Global_reactive_power) == TRUE),]
t = table(DATA$Date)
print(t)
2006-12-21 2006-12-30 2007-01-14 2007-01-28 2007-02-22 2007-03-25 2007-04-28
2 2 1 1 2 1 1359
2007-04-29 2007-04-30 2007-06-01 2007-06-06 2007-06-09 2007-06-19 2007-06-29
1440 924 1 1 38 2 1
2007-07-15 2007-07-22 2007-08-01 2007-08-24 2007-09-26 2007-10-23 2007-11-21
130 1 21 1 2 2 1
2007-11-29 2007-12-17 2008-01-13 2008-02-02 2008-02-23 2008-03-24 2008-05-16
1 1 1 1 2 1 2
2008-06-13 2008-07-13 2008-08-04 2008-08-31 2008-10-25 2008-11-10 2008-11-12
1 2 1 1 43 6 2
2008-11-23 2008-12-10 2008-12-20 2009-01-14 2009-02-01 2009-02-14 2009-02-17
1 70 1 1 38 2 24
2009-03-01 2009-03-16 2009-04-13 2009-05-10 2009-05-26 2009-06-13 2009-06-14
1 1 2 1 3 1350 1440
2009-06-15 2009-07-10 2009-08-13 2009-09-13 2009-09-30 2009-10-11 2009-11-09
515 4 891 1 2 1 1
2009-12-10 2010-01-02 2010-01-12 2010-01-13 2010-01-14 2010-01-23 2010-02-10
2 1 547 1440 1142 1 1
2010-02-14 2010-03-20 2010-03-21 2010-04-11 2010-05-13 2010-06-12 2010-06-29
1 1208 819 1 1 1 1
2010-07-15 2010-08-17 2010-08-18 2010-08-19 2010-08-20 2010-08-21 2010-08-22
1 118 1440 1440 1440 1440 1348
2010-09-25 2010-09-26 2010-09-27 2010-09-28 2010-10-24
1144 1440 1440 1213 1
We see that there are consecutive days where we have null values for all the minutes in that days. We can then suppose that during that time the family was away and turned off the electricity. We can then substitute these NA with 0, cause it represent the real “consumption of energy” during that days. There are then days where there are very few null values, we can suppose that this is due to temporary malfunction of the machine. We can then substitute these NA with the previous non-null value.
# I create a new dataset where I am going to apply the substitution.
mydata = data
#create a list with the dates where I want to substitute the null values with 0.
l = c()
for (i in 1:length(row.names(t)))
{
if (t[i]>=178)
{l = c(l, row.names(t)[i])}
}
a = which( (as.character(mydata$Date) %in% l) & is.na(mydata$Global_active_power) )
mydata[a,3:9] = c(0,0,0,0,0,0,0)
# For the remaining days we substitute the null values with the previous no-null value
for (i in 3:9)
{
mydata[,i] = na_locf(mydata[,i], option='locf')
}
We create a new column with the energy that is used in the house but not in the devices connected to the submetering 1,2 and 3. In this way we have divided the total electricity used by the familiy in four different sources.
#create a column with energy not used in submetering 1,2,3
mydata$other = as.numeric(mydata$Global_active_power-mydata$Sub_metering_1-mydata$Sub_metering_2-mydata$Sub_metering_3)
We made different plots to understand which useful information can be extracted from the device using different aggregation of time. Here in the notebook you can see some examples.
We look at the energy consuption per minute in the different submetering. we would like to show how it is possible to identify some electrical appliances. Indeed, every appliance has a specific pattern in the consumption of energy. This information can be used to check if there are malfunctions or if the device is “too old” and it consuming more energy then expected. For example, in the following graph you can see the consumption of energy of the dishwasher. Thanks to this level of details we have in the data, we are able to calculate the energy used for washing the dishes.
dayconsidered = subset( mydata, day(Date)==8 & month(Date)==10 & year(Date) == 2010)
g =ggplot(dayconsidered[c(1200:1440),],aes(x = c(1200:1440), y = Sub_metering_1))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]), "dishwasher")+xlab('minutes')+
ylab('Sub metering 1')+theme_minimal()
print(g)
If we look at the submetering two, we can see a pattern that keeps repearing in the consumption of energy. This is related with the fridge. Indeed, this appliance needs always energy at regular intervals. It is interesting that if a family monitor the energy real time, it could be possible to know immediately if the fridge remains open for too long. This can represent a useful application of the submetering devices.
dayconsidered = subset( mydata, day(Date)==8 & month(Date)==10 & year(Date) == 2010)
g =ggplot(dayconsidered,aes(x = c(1:1440), y = Sub_metering_2))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]), "fridge")+xlab('minutes')+
ylab('Sub metering 2')+theme_minimal()
print(g)
If we look at the energy used that it is not tracked by the submetering devices, we can see that it more difficult to capture the behaviour of the single appliances. Nevertheless, it is possible to check when the peaks of energy occur during the days and this can be a useful information to monitor that noting unexpected has happened.
dayconsidered = subset( mydata, day(Date)==8 & month(Date)==10 & year(Date) == 2010)
g =ggplot(dayconsidered,aes(x = c(1:1440), y = other))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]))+xlab('minutes')+
ylab('Other consumption')+theme_minimal()
print(g)
The consumption of energy aggregated by hour it is an indication of when the family is more “active” at home. This information can be used by the family to distribute better the consumption of energy in order to save money. Indeed, the electricity can have a different price depending on which hour is used.
# First we aggregate the energy by hour.
mydatah= mydata[,2:ncol(mydata)]
mydatah$hour = format( mydata$DateTime, format="%H")
mydatahours = data.frame(mydatah %>% group_by(Date, hour ) %>% summarise_all(sum))
# We plot the energy consumption in one specific day
dayconsidered = subset(mydatahours, day(Date)== 5& month(Date)==3 & year(Date) == 2010)
g =ggplot(dayconsidered,aes(x = c(1:24), y = Global_active_power))+ geom_point()+geom_line()+
ggtitle(paste('Date:', weekdays(dayconsidered$Date[1]), dayconsidered$Date[1]))+xlab("hour")+theme_minimal()
print(g)
We aggregate the energy used in a mounth and we look at how the energy consumption change during the months.
# We aggregate the data
mydatamonth = mydata
mydatamonth$monthyear = format(mydata$Date, "%Y-%m")
mydatamonth$DateTime = NULL
mydatamonth$Date = NULL
mydatamonths = mydatamonth %>% group_by(monthyear) %>% summarize_all(sum)
mydatamonths$year = as.factor(substring(mydatamonths$monthyear,1,4))
mydatamonths$month = as.factor(substring(mydatamonths$monthyear,6,8))
First we look at the global energy used in the different months. It is immediately clear that the energy consumption change with the seasons. Indeed we reach the peak during the winter and the trough during summer.
ggplot(data=mydatamonths[c(2:48),], aes(x=month, y= Global_active_power, group= year)) +
geom_line(aes(color= year))+
geom_point()+
theme(legend.position="top")+labs(y= "Global active energy", x = "Month", color=' Years')+theme_minimal()
We can then look how much of the energy is used during the year for the different group of appliance that are connected to the different submetering devices.
# First we group the enrgy by year
mydatayears = data.frame(mydatamonths[,2: (ncol(mydatamonths)-1)] %>% group_by(year) %>% summarize_all(sum))
# Then we can plot the pie chart for one year.
ds <- data.frame(labels = c("Sub-metering 1", "Sub-metering 2", "Submetering 3", 'Other energy used'), values = unlist(mydatayears[4,c(6,7,8,9)]))
plot_ly(ds, labels = ~labels, values = ~values) %>%add_pie() %>% layout(title = "2009")
Finally, we can see how during the year the consummption of electricity changes in the different submetering.
plot_ly(mydatamonths %>% filter(year==2009), x = ~month, y = ~Sub_metering_1, type = 'bar', name = 'Sub metering 1') %>%
add_trace(y = ~Sub_metering_2, name = 'Sub metering 2') %>%
add_trace(y = ~Sub_metering_3, name = 'Sub metering 3') %>%
add_trace(y =~other, name = 'Other energy used') %>%
layout(yaxis = list(title = 'Energy'),xaxis = list(title = 'Month'), title = "2009" , barmode = 'stack')
When we have to forecast, we use a different approach with the null values. Indeed, since the holidays of the family were not always in the same period, we decide to fill that missing values using the usual consumption of energy in a similar period (instead of substituting them with 0). For shorter period of missing values, we consider the previous non null value.
energy = data
energy$Voltage <-NULL
energy$Global_intensity <-NULL
# As before we look at number of null values per day
DATA=data[which(is.na(data$Global_reactive_power) == TRUE),]
t = table(DATA$Date)
# In the day where we have few missing values we use the previous non null value
mv= c()
for (i in 1:length(t))
{
if (t[i] < 178 )
{mv = c(mv, row.names(t)[i])}
}
b = which( (as.character(energy$Date) %in% mv) )
for(i in 3:7)
{
energy[b,i] = na_locf(energy[b,i], option='locf')
}
# For the rest of the missing values, we consider all the values of energy, that were registered in the different years in the same season, weekday and minute, and we subtitute a random value from them.
energy$season = 'winter'
energy$season[month(energy$DateTime)>2 & month(energy$DateTime) <6 ]='spring'
energy$season[month(energy$DateTime)>5 & month(energy$DateTime) <9 ]='summer'
energy$season[month(energy$DateTime)>8 & month(energy$DateTime) <12 ]='autumn'
energy$season = as.character(energy$season)
energy$weekdays = weekdays(energy$DateTime)
ll= unique(energy$season)
l = unique(energy$weekdays)
for (i in ll){
for (j in 0:23){
for (k in 1:length(l))
{
index = which( hour(energy$DateTime) == j & as.character(energy$weekdays)== l[k] & as.character(energy$season)==i )
for (m in 3:7)
{energy[index,m] = as.numeric(impute(energy[index,m],'random'))}
}}}
After have examinated the data with different granularities (days, weekdays,etc.), we have decided to concentrate in predicting the energy by months. Indeed there is clearly a seasonality in this case. In the following code, we show the forecast we have obtained for the global energy using three well known methods for time series forecasting.
We create a time serie object and we create the training and test set. As training we will consider the first three years of data and as test the last eleven months of the last year.
# Create a time serie object
energymonths = data.frame(energy %>% mutate(monthyear = format(mydata$Date, "%Y-%m")) %>% select(monthyear, Global_active_power) %>% group_by(monthyear) %>% summarize_all(sum))
energymonths = ts(energymonths[c(2:47),2], start=2007, frequency = 12)
# Create the training and the test
training = subset(energymonths, start = 1 ,end = 36)
testing= subset(energymonths, start= 37, end= 47)
In the following graph, you can see the time serie we are analysing.
fit.lm = tslm( training ~ trend + season )
summary(fit.lm)
Call:
tslm(formula = training ~ trend + season)
Residuals:
Min 1Q Median 3Q Max
-216633 -31502 -10045 40466 136684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1111755 51366 21.644 < 2e-16 ***
trend -1269 1413 -0.898 0.37855
season2 -226824 67852 -3.343 0.00282 **
season3 -154044 67896 -2.269 0.03298 *
season4 -329264 67970 -4.844 6.87e-05 ***
season5 -340739 68073 -5.006 4.60e-05 ***
season6 -452072 68205 -6.628 9.21e-07 ***
season7 -571338 68365 -8.357 2.01e-08 ***
season8 -663568 68555 -9.679 1.41e-09 ***
season9 -378292 68773 -5.501 1.36e-05 ***
season10 -243289 69020 -3.525 0.00181 **
season11 -132970 69294 -1.919 0.06749 .
season12 -23120 69596 -0.332 0.74275
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 83080 on 23 degrees of freedom
Multiple R-squared: 0.8979, Adjusted R-squared: 0.8446
F-statistic: 16.85 on 12 and 23 DF, p-value: 1.285e-08
checkresiduals(fit.lm)
Breusch-Godfrey test for serial correlation of order up to 16
data: Residuals from Linear regression model
LM test = 25.385, df = 16, p-value = 0.06332
a1 = accuracy(as.numeric(testing), forecast( fit.lm, h = 11)$mean)
print(a1)
ME RMSE MAE MPE MAPE ACF1 Theil's U
Test set -41279.06 63813.05 55923.34 -7.064748 8.6574 -0.1021445 0.3305876
First we look at the method (Holt winter). In the following image we can see the results we obtain in the training. In particular the values for the parameters alpha and gamma, and the results we obtain in the training.
fit.ets= ets(training)
summary(fit.ets)
ETS(A,N,A)
Call:
ets(y = training)
Smoothing parameters:
alpha = 1e-04
gamma = 1e-04
Initial states:
l = 800332.3301
s = 280211.8 166541.2 39432.12 -88416.77 -407853.3 -256676
-133867.9 -39077.34 15980.62 123496.1 36135.54 264093.8
sigma: 92936.28
AIC AICc BIC
964.9337 988.9337 988.6865
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -5019.717 72651.63 51052.51 -2.748668 8.835276 0.5768878 -0.02822495
We do the following test to look at the residuals.
checkresiduals(fit.ets)
Ljung-Box test
data: Residuals from ETS(A,N,A)
Q* = 27.239, df = 3, p-value = 5.245e-06
Model df: 14. Total lags used: 17
We look at the error on the test.
a1= accuracy(as.numeric(testing), forecast(fit.ets,h = 11)$mean)
print(a1)
ME RMSE MAE MPE MAPE ACF1 Theil's U
Test set -8791.1 54109.66 43909.72 -2.124202 6.242481 -0.1966048 0.2220527
We do a similar study with the model ARIMA. We look at the parameters.
fit.arima = auto.arima(training)
summary(fit.arima)
Series: training
ARIMA(0,0,0)(1,1,0)[12]
Coefficients:
sar1
-0.7563
s.e. 0.1237
sigma^2 estimated as 7.377e+09: log likelihood=-311.29
AIC=626.59 AICc=627.16 BIC=628.94
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -14099.41 68650.95 42701.77 -3.83786 7.618383 0.4825253 0.03684593
We study the residuals.
checkresiduals(fit.arima)
Ljung-Box test
data: Residuals from ARIMA(0,0,0)(1,1,0)[12]
Q* = 6.6713, df = 6, p-value = 0.3523
Model df: 1. Total lags used: 7
We look at the error on the test.
a2= accuracy(as.numeric(testing), forecast(fit.arima,h = 11)$mean)
print(a2)
ME RMSE MAE MPE MAPE ACF1 Theil's U
Test set -16873.03 76176.98 55563.79 -6.479249 11.14715 -0.2831091 0.215031
Comparing the three models we can see that …